Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
نویسنده
چکیده
he Map/Reduce framework-a parallel processing paradigm-is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous data streams easily but does not provide direct support for handling multiple heterogeneous input data streams. Thus the binary relational join operator does not have efficient implementation in the Map/Reduce framework. Some implementations of the join operator exist for the Hadoop distribution of the Map/Reduce framework. However, these implementations do not perform well in case of heavily skewed data. Skew in the input data affects the performance of the join operator in parallel environment where data is distributed among parallel sites for independent joins. Data skew can severely limit the effectiveness of parallel architectures when some processing units (PUs) are overloaded during data distribution and hence take a greater time for completion as compared to other PUs. This also results in wastage of resources of the idle PUs. As data skew naturally occurs in many applications, handling it is an important issue for improving the performance of the join operation. We implement a hash join algorithm which is a hybrid of the map-side and the reduce-side joins of Hadoop with the ability to handle skew and we compare its performance to the other join algorithms of Hadoop.
منابع مشابه
A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework
The Map/Reduce framework is a programming model recently introduced by Google Inc. to support distributed computing on very large datasets across a large number of machines. It provides a simple but yet powerful way to implement distributed applications without having deeper knowledge of parallel programming. Each participating node executes Map and/or Reduce tasks which involve reading and wri...
متن کاملSentiment Analysis of Social Networking Data Using Categorized Dictionary
Sentiment analysis is the process of analyzing a person’s perception or belief about a particular subject matter. However, finding correct opinion or interest from multi-facet sentiment data is a tedious task. In this paper, a method to improve the sentiment accuracy by utilizing the concept of categorized dictionary for sentiment classification and analysis is proposed. A categorized dictiona...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملAdaptive Join Plan Generation in Hadoop For CPS296.1 Course Project
Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a cle...
متن کاملA Scalable and Skew-insensitive Algorithm for Join Operations using Map/Reduce Model
For over a decade, Map/Reduce has become a prominent programming model to handle vast amounts of raw data in large scale systems. This model ensures scalability, reliability and availability aspects with reasonable query processing time. However these large scale systems still face some challenges : data skew, task imbalance, high disk i/o and redistribution costs can have disastrous effects on...
متن کامل